Positions and intervals provide addresses for anything that can be mapped to a reference genome:
It is no wonder that positions and intervals are key fields in most genomic data files.
For example, next-generation sequencing (NGS) technologies can map all kinds of functional biomolecular information along this coordinate system.
Show a grey line - genome - with some genes, orange intervals - like on the figure below Then show another track - let’s say H3K4me3 (just say that that you can do experiment where you can measure gene activity or something like that) And then pose a question - we would like to find a subset of active genes in this cell line or whatever Therefore - OVERLAP ! Then you say that this is the most basic/fundamental operation in genomics
The core objects in bioframe are pandas DataFrames of genomic intervals, or BedFrames defined by chrom, start and end columns (the names are configurable). We don't use opaque wrapper objects or DataFrame subclasses.
import bioframe as bf
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)
display(df1)
bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
plt.title('bedFrame1 intervals');
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 1 | 5 |
| 1 | chr1 | 3 | 8 |
| 2 | chr1 | 8 | 10 |
| 3 | chr1 | 12 | 14 |
df2 = bf.from_any(
[['chr1', 4, 8],
['chr1', 10, 11]],
name_col='chrom'
)
display(df2)
bf.vis.plot_intervals(df2, show_coords=True, xlim=(0,16), colors='lightpink')
plt.title('bedFrame2 intervals');
| chrom | start | end | |
|---|---|---|---|
| 0 | chr1 | 4 | 8 |
| 1 | chr1 | 10 | 11 |
The most common operation is to calculate the overlaps between two sets of genomic intervals.
overlaps = bf.overlap(df1, df2, how='inner', suffixes=('_1','_2'))
overlaps
| chrom_1 | start_1 | end_1 | chrom_2 | start_2 | end_2 | |
|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | chr1 | 4 | 8 |
| 1 | chr1 | 3 | 8 | chr1 | 4 | 8 |
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14],
],
columns=['chrom', 'start', 'end']
)
bf.vis.plot_intervals(df1, show_coords=True, xlim=(0,16))
df_annotated = bf.cluster(df1, min_dist=0)
display(df_annotated)
bf.vis.plot_intervals(df_annotated, labels=df_annotated['cluster'], show_coords=True, xlim=(0,16))
| chrom | start | end | cluster | cluster_start | cluster_end | |
|---|---|---|---|---|---|---|
| 0 | chr1 | 1 | 5 | 0 | 1 | 10 |
| 1 | chr1 | 3 | 8 | 0 | 1 | 10 |
| 2 | chr1 | 8 | 10 | 0 | 1 | 10 |
| 3 | chr1 | 12 | 14 | 1 | 12 | 14 |
#import hg
#hg.Viewconf.from_url("https://resgen.io/api/v1/viewconfs/Y_omIrpERgG01VsqmtMLVA/?raw=1")
Genomic intervals are one of the most prevalent data structures in computational genome biology, and used to represent features ranging from genes, to DNA binding sites, to disease variants. Operations on genomic intervals provide a language for asking questions about relationships between features.
While there are excellent interval arithmetic tools for the command line, they are not smoothly integrated into Python, one of the most popular general-purpose computational and visualization environments.
Bioframe is a library to enable flexible and performant operations on genomic interval dataframes in Python. Bioframe extends the Python data science stack to use cases for computational genome biology by building directly on top of two of the most commonly-used Python libraries, numpy and pandas.
The bioframe API enables flexible name and column orders, and decouples operations from data formats to avoid unnecessary conversions, a common scourge for bioinformaticians. Bioframe achieves these goals while maintaining high performance and a rich set of features.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import bioframe
Toy examples with plots
Figure from paper
ctcf_peaks = bioframe.read_table(
"https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
schema='narrowPeak'
)
ctcf_peaks
| chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr19 | 48309541 | 48309911 | . | 1000 | . | 5.04924 | -1.0 | 0.00438 | 185 |
| 1 | chr4 | 130563716 | 130564086 | . | 993 | . | 5.05052 | -1.0 | 0.00432 | 185 |
| 2 | chr1 | 200622507 | 200622877 | . | 591 | . | 5.05489 | -1.0 | 0.00400 | 185 |
| 3 | chr5 | 112848447 | 112848817 | . | 869 | . | 5.05841 | -1.0 | 0.00441 | 185 |
| 4 | chr1 | 145960616 | 145960986 | . | 575 | . | 5.05955 | -1.0 | 0.00439 | 185 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 40582 | chr8 | 22574315 | 22574744 | . | 1000 | . | 561.11939 | -1.0 | 4.90268 | 243 |
| 40583 | chr15 | 56246029 | 56246402 | . | 1000 | . | 569.05663 | -1.0 | 4.90268 | 191 |
| 40584 | chr1 | 150979463 | 150979845 | . | 1000 | . | 580.28482 | -1.0 | 4.90268 | 194 |
| 40585 | chr16 | 57649040 | 57649402 | . | 1000 | . | 602.95266 | -1.0 | 4.90268 | 173 |
| 40586 | chr12 | 54379625 | 54380042 | . | 1000 | . | 627.60723 | -1.0 | 4.90268 | 203 |
40587 rows × 10 columns
jaspar_url = 'http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/'
jaspar_motif_file = 'MA0139.1.tsv.gz'
ctcf_motifs = bioframe.read_table(
jaspar_url + jaspar_motif_file,
schema='jaspar',
skiprows=1
)
ctcf_motifs
| chrom | start | end | name | score | pval | strand | |
|---|---|---|---|---|---|---|---|
| 0 | chr1 | 11163 | 11182 | CTCF | 811 | 406 | - |
| 1 | chr1 | 11222 | 11241 | CTCF | 959 | 804 | - |
| 2 | chr1 | 11280 | 11299 | CTCF | 939 | 728 | - |
| 3 | chr1 | 11339 | 11358 | CTCF | 837 | 455 | - |
| 4 | chr1 | 11401 | 11420 | CTCF | 829 | 439 | - |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 770044 | chrY_KI270740v1_random | 4600 | 4619 | CTCF | 815 | 412 | - |
| 770045 | chrY_KI270740v1_random | 5601 | 5620 | CTCF | 815 | 412 | - |
| 770046 | chrY_KI270740v1_random | 7601 | 7620 | CTCF | 815 | 412 | - |
| 770047 | chrY_KI270740v1_random | 8602 | 8621 | CTCF | 815 | 412 | - |
| 770048 | chrY_KI270740v1_random | 19753 | 19772 | CTCF | 806 | 395 | - |
770049 rows × 7 columns
df = bioframe.overlap(
ctcf_peaks,
ctcf_motifs,
# suffixes=('_1', ''),
return_index=True,
how='left',
)
motifs_per_peak = df.groupby(["index"])["index_"].count().values
bins = np.arange(0, np.max(motifs_per_peak))
counts, _ = np.histogram(motifs_per_peak, bins)
plt.bar(bins[:-1], height=counts, align='center')
plt.xlabel('number of overlapping motifs per peak')
plt.ylabel('number of peaks')
plt.semilogy();
print(f'fraction of peaks without motifs {np.round(np.sum(motifs_per_peak==0)/len(motifs_per_peak),2)}');
fraction of peaks without motifs 0.14
df
| index | chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | index_ | chrom_ | start_ | end_ | name_ | score_ | pval_ | strand_ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | chr19 | 48309541 | 48309911 | . | 1000.0 | . | 5.04924 | -1.0 | 0.00438 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 1 | 1 | chr4 | 130563716 | 130564086 | . | 993.0 | . | 5.05052 | -1.0 | 0.00432 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 2 | 2 | chr1 | 200622507 | 200622877 | . | 591.0 | . | 5.05489 | -1.0 | 0.00400 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 3 | 3 | chr5 | 112848447 | 112848817 | . | 869.0 | . | 5.05841 | -1.0 | 0.00441 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| 4 | 4 | chr1 | 145960616 | 145960986 | . | 575.0 | . | 5.05955 | -1.0 | 0.00439 | 185.0 | <NA> | None | <NA> | <NA> | None | NaN | NaN | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48626 | 40584 | chr1 | 150979463 | 150979845 | . | 1000.0 | . | 580.28482 | -1.0 | 4.90268 | 194.0 | 40407 | chr1 | 150979668 | 150979687 | CTCF | 925.0 | 678.0 | - |
| 48627 | 40585 | chr16 | 57649040 | 57649402 | . | 1000.0 | . | 602.95266 | -1.0 | 4.90268 | 173.0 | 261215 | chr16 | 57649185 | 57649204 | CTCF | 918.0 | 656.0 | - |
| 48628 | 40585 | chr16 | 57649040 | 57649402 | . | 1000.0 | . | 602.95266 | -1.0 | 4.90268 | 173.0 | 261216 | chr16 | 57649229 | 57649248 | CTCF | 829.0 | 439.0 | + |
| 48629 | 40586 | chr12 | 54379625 | 54380042 | . | 1000.0 | . | 627.60723 | -1.0 | 4.90268 | 203.0 | 153951 | chr12 | 54379782 | 54379801 | CTCF | 863.0 | 511.0 | - |
| 48630 | 40586 | chr12 | 54379625 | 54380042 | . | 1000.0 | . | 627.60723 | -1.0 | 4.90268 | 203.0 | 153952 | chr12 | 54379835 | 54379854 | CTCF | 885.0 | 564.0 | - |
48631 rows × 19 columns
%%bash
echo "Hello World"
which sed
Hello World /bin/sed
import hg
config = hg.Viewconf.from_url('https://higlass.io/api/v1/viewconf?d=default')
config